Visual Question Answering (VQA) is the task of answering natural-languagequestions about images. We introduce the novel problem of determining therelevance of questions to images in VQA. Current VQA models do not reason aboutwhether a question is even related to the given image (e.g. What is the capitalof Argentina?) or if it requires information from external resources to answercorrectly. This can break the continuity of a dialogue in human-machineinteraction. Our approaches for determining relevance are composed of twostages. Given an image and a question, (1) we first determine whether thequestion is visual or not, (2) if visual, we determine whether the question isrelevant to the given image or not. Our approaches, based on LSTM-RNNs, VQAmodel uncertainty, and caption-question similarity, are able to outperformstrong baselines on both relevance tasks. We also present human studies showingthat VQA models augmented with such question relevance reasoning are perceivedas more intelligent, reasonable, and human-like.
展开▼